0%

(CVPR 2018) Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network

Zhang Z, Xie Y, Yang L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6199-6208.



1. Overview


1.1. Motivation

  • fully end-to-end map from low-dimension text space to a high-resolution image space still remains unsolved
  • two difficulty
    • balance the convergence between G and D
    • stably model the huge pixel space in high-resolution images and guaranteeing semantic consistency


In this paper, it proposed an extensille single-stream generator architecture (HDGAN)

  • jointed discriminator. regularize mid-level representation and assist generator training to capture the complex image statistics
  • multi-purpose adversarial loss
  • new visual-semantic similarity measure

  • single stage

  • no multiple text condition
  • no additional class label supervision

1.2. Dataset

  • CUB birds
  • Oxford-102 flowers
  • MSCOCO

1.3.1. Generative Models

  • GAN
  • VAE

1.3.2. Text-to-Image

  • (ICML 2016) GAN
  • (NIPS 2016) GAN what-where network
  • (ICCV 2017) StackGAN
  • (ICCV 2017) joint embedding
  • perceptual loss
  • auxiliary classifier
  • attention-driven

1.3.3. Stability of GAN

  • training techniques
  • regularization using extra knowledge
  • combination of G and D

As the targeting image resolution increases, training difficulty increases.

1.3.4. Decompose into Multiple Subtasks

  • LAP-GAN
  • symmetric G and D
  • stage-by-stage



2. Methods




2.1. Hierarchical-nested Adversarial Objective



  • G. hierarchical generator
  • z. noise
  • t. sentence embedding by pre-trained char-RNN text encoder
  • s. number of scales
  • X_n. gradually growing resolution

  • lower-resolution. learn semantic consistent image structure

  • higher-resolution. render fine-grained details

2.2. Multi-purpose Adversarial Loss



  • pair loss (1). guarantee the global semantic consistency
  • Image loss (R_i x R_i). low-resolution D focus on global structures, high focus on local image details
  • output.
  • two type of errors
    • real image + mismatched text
    • fake image + conditioned text

2.2.1. D



2.2.2. G



2.2.3. Conditioning Augmentation

Instead of directly using the deterministic text embedding, sample a stochastic vector from a Gaussian distribution



And add Kullback-Leiblere divergence regularization term to prevent over-fitting and force smooth



2.3. Architecture

2.3.1. G

  • three modules
    • K-repeat ResBlock. 2 Conv + BN-ReLU
    • stretching layers. x2 nearest upsample + Conv-BN-ReLU
    • linear compression layers. Conv-Tanh

structure:1-2-1-2-….

  • text embedding of CA. 1024 x 4 x 4

2.3.2. D

  • stride-2 conv + BN-LeakyReLU
  • two branch following
    • FCN produce R_i x R_i probability map
    • concat 512 x 4 x 4 feature map and 128 x 4 x 4 text embedding (reduced), 1x1 Conv + 4x4 Co



3. Experiments


3.1. Metrics

  • Inception Score. need pre-trained Inception model (fine-tuned on dataset of this paper)
  • Multi-scale Structural Similarity (MS-SSIM). pairwise similarity, lower score indicates higher diversity of generated image
  • VIsual-semantic Similarity. train a visual-semantic embedding model to measure the distance between text and image


  • δ. margin set to 0.2

3.2. Comparison




  • better preserve semantically consistent information in all resolution


3.3. Style Transfer



3.4. Ablation Study





  • local image loss for higher performantic and offer the pair loss more focus on learning sementic consistency

3.5. share top layers of Ds

  • not observe benifits